February 20, 2025
\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \newcommand{\ostar}{\mathbin{\mathpalette\make@circled*}} \]
Neural networks are a powerful framework for machine learning!
Start with an initial set of features
Create a mapping of the initial feature space to a \(K_1\) dimensional space where each dimension of the hidden space corresponds to some subfunction of the features
Each subfunction adds together to create the original feature
Just organized in a meaningful way that allows a classifier to better separate between classes
Fully connected NNs have the following form:
Each layer takes an input, \(\mathbf x_i\) and translates the input to a \(K\) dimensional space through a linear mapping. For dimension \(k \in K\):
\[\underset{1 \times 1}{\mathbf q_{ik}} = \underset{1 \times K_{-1}}{\mathbf W_k^T} \underset{K_{-1} \times 1}{\mathbf x_i} + \underset{1 \times 1}{b_{k}}\]
Each value is the passed through an activation function to induce nonlinearity in the dimension mapping:
\[\underset{1 \times 1}{\mathbf q_{ik}} = \underset{1 \times 1}{\max(0,q_{ik})}\]
Repeat this mapping for each hidden layer.
At the final step, create a linear classifier in the altered feature space!
Can’t fully display the final 128 dimenional space
Flattening works okay for this data set.
Generically, this isn’t ideal though!
Images have spatial information; the values of pixels in a neighborhood of the image are informative for each other!
Let the hidden scores for \(i\) at a layer be denoted by \(\mathbf h_i \in \mathbb R^K\). For element \(k \in K\) in the first hidden layer, we can compute \(h_{ik}\) as:
\[h_{ik} = \varphi(\mathbf w_k^T \mathbf x_i)\]
For zero-centered values, this is equivalent to the covariance between the two vectors!
And we’ll only see positive covariance because of the ReLU nonlinearity!
Each \(\mathbf w_k\) denotes a template for a part of the image.
Then computing the inner product of \(\mathbf x_i\) and \(\mathbf w_k\) show the similarity between the \(i^{th}\) input and the template
If the similarity is big, large positive value
If the similarity is small, 0 or small positive value
Each of these images has the three prong feature!
The problem is that our weight matrix does not exhibit translational invariance
How do we encode colors in an image?
Each pixel is a vector of 3 values between 0 and 255: Red (R), Blue (B), and Green (G)
Combined, we can create basically any color!
RGB images present another problem with using FCNNs for analyzing images: number of parameters
Let’s say that we have 100k 28 x 28 images (relatively low resolution) with 3 color channels
Our entire feature set now has 235 million features!
The first layer with 128 hidden units would estimate 301 thousand coefficients!
That’s too many…
To get around these problems, ML researchers introduced convolutional neural networks
The first few layers of a CNN are convolutional layers that divide the input into small overlapping image patches - collections of pixels in a small neighborhood.
Each image patch is then compared to \(K\) templates - small weight matrices (think \(3 \times 3\) or \(5 \times 5\)) that are learned from the data that represent different patterns within each image.
The single template with \(5 \times 5\) weights is convolved with the input image - slid over all possible locations in the original 2D or 3D image space - the produce a new feature mapping after passed through a ReLU transform.
Key point 1:
Convolution operations will yield translational invariance
This is key for image classification!
Key point 2:
A convolution operation has very few parameters compared to a FCNN!
For a single hidden unit with a \(28 \times 28 \times 3\) image, we need to compute 2352 weights!
For a \(5 \times 5 \times 3\) filter slid over the \(28 \times 28 \times 3\) image, we only have 75 parameters!
In the machine learning literature, the convolution operation is actually the cross-correlation operation.
1D Convolution
\[[1,2,3,4,5,6] \circledast [1,2] = [5,8,11,14,17]\]
1D Convolution
\[[\boxed{\color{red}1,2},3,4,5,6] \circledast [\boxed{\color{red}1,2}] = [\boxed{\color{red}5}, 8, 11,14,17]\]
\[(1*1) + (2*2) = 5\]
1D Convolution
\[[1,\boxed{\color{red}2,3},4,5,6] \circledast [\boxed{\color{red}1,2}] = [5, \boxed{\color{red}8}, 11,14,17]\]
\[(2*1) + (3*2) = 8\]
1D Convolution
\[[1,2,\boxed{\color{red}3,4},5,6] \circledast [\boxed{\color{red}1,2}] = [5, 8, \boxed{\color{red}11},14,17]\]
\[(3*1) + (4*2) = 11\]
\[[1,2,3,4,5,6] \circledast [1,2] = [5,8,11,14,17]\]
The convolution of a 6-vector with a 2-vector yields a 5-vector
In general, a convolution of a larger vector with a smaller filter will result in a smaller output!
2D Convolution
Let’s consider a 3x3 matrix \(\mathbf A\) and a 2x2 weight matrix \(\mathbf W\).
\[\begin{bmatrix} \boxed{a_{11}} & \boxed{a_{12}} & a_{13} \\ \boxed{a_{21}} & \boxed{a_{22}} & a_{23} \\ a_{31} & a_{32} & a_{33} \\ \end{bmatrix} \circledast \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \\ \end{bmatrix} = \mathbf C\]The top-left element of \(C\) is computed as follows:
\[ c_{11} = a_{11}w_{11} + a_{12}w_{12} + a_{21}w_{21} + a_{22}w_{22} \]
2D Convolution
Let’s consider a 3x3 matrix \(\mathbf A\) and a 2x2 weight matrix \(\mathbf W\).
\[\begin{bmatrix} a_{11} & \boxed{a_{12}} & \boxed{a_{13}} \\ a_{21} & \boxed{a_{22}} & \boxed{a_{23}} \\ a_{31} & a_{32} & a_{33} \\ \end{bmatrix} \circledast \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \\ \end{bmatrix} = \mathbf C\]The next element of \(C\) is computed as follows:
\[ c_{12} = a_{12}w_{11} + a_{13}w_{12} + a_{22}w_{21} + a_{23}w_{22} \]
2D Convolution
Let’s consider a 3x3 matrix \(\mathbf A\) and a 2x2 weight matrix \(\mathbf W\).
\[\begin{bmatrix} a_{11} & a_{12} & a_{13} \\ \boxed{a_{21}} & \boxed{a_{22}} & a_{23} \\ \boxed{a_{31}} & \boxed{a_{32}} & a_{33} \\ \end{bmatrix} \circledast \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \\ \end{bmatrix} = \mathbf C\]The next element of \(C\) is computed as follows:
\[ c_{21} = a_{21}w_{11} + a_{22}w_{12} + a_{31}w_{21} + a_{32}w_{22} \]
2D Convolution
Let’s consider a 3x3 matrix \(\mathbf A\) and a 2x2 weight matrix \(\mathbf W\).
\[\begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & \boxed{a_{22}} & \boxed{a_{23}} \\ a_{31} & \boxed{a_{32}} & \boxed{a_{33}} \\ \end{bmatrix} \circledast \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \\ \end{bmatrix} = \mathbf C\]The next element of \(C\) is computed as follows:
\[ c_{22} = a_{22}w_{11} + a_{23}w_{12} + a_{32}w_{21} + a_{33}w_{22} \]
All we’re doing is sliding the kernel matrix over the image to get some summary of the patches!
As we showed above, this is computing the cross correlation of the filter with the segment of the image!
If the value is high, then the cross product of the kernel with the image segment is also high
Tells us if the feature is present in the patch or not.
Before we show this off, I should mention one thing about convolutions and the resulting size
In the 2D case with a \(3 \times 3\) image and a \(2 \times 2\) filter, we ended up with a \(2 \times 2\) convolution.
Try to find the following sizes yourself using only valid configurations:
\(4 \times 4\) image and a \(2 \times 2\) filter
\(4 \times 4\) image and a \(3 \times 3\) filter
\(4 \times 3\) image and a \(2 \times 2\) filter
Can you come up with the general rule?
For a \(n_h \times n_w\) image and a \(f_h \times f_w\) filter, we get a:
\[(n_h - f_h + 1) \times (n_w - f_w + 1)\]
convolution.
To yield a same convolution (the output is the same size as the input), we can use zero-padding to make the edges of the image wider without injecting new pixel information.
2D Convolution
Let’s consider a 5x5 matrix \(\mathbf A\) and a 3x3 weight matrix \(\mathbf W\).
\[\begin{bmatrix} \boxed{0} & \boxed{0} & \boxed{0} & 0 & 0 \\ \boxed{0} & \boxed{a_{11}} & \boxed{a_{12}} & a_{13} & 0 \\ \boxed{0} & \boxed{a_{21}} & \boxed{a_{22}} & a_{23} & 0 \\ 0 & a_{31} & a_{32} & a_{33} & 0 \\ 0 & 0 & 0 & 0 & 0 \\ \end{bmatrix} \circledast \begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \\ w_{31} & w_{32} & w_{33} \\ \end{bmatrix} = \mathbf C\]If we zero pad \(p_h\) rows to the top and bottom of the image and \(p_w\) columns to the left and right of the image, then the resulting convolution will have size:
\[(n_h + 2p_h - f_h + 1) \times (n_w + 2p_w - f_w + 1)\]
Assuming a square filter and setting \(p_h = p_w\), what should we set the pad width to such that the convolution is of the same size as the input image?
With same convolution, we can think of the resulting feature maps given an odd-sized kernel centering on each pixel of the image.
\[\begin{bmatrix} \boxed{0} & \boxed{0} & \boxed{0} & 0 & 0 \\ \boxed{0} & \boxed{a_{11}} & \boxed{a_{12}} & a_{13} & 0 \\ \boxed{0} & \boxed{a_{21}} & \boxed{a_{22}} & a_{23} & 0 \\ 0 & a_{31} & a_{32} & a_{33} & 0 \\ 0 & 0 & 0 & 0 & 0 \\ \end{bmatrix} \circledast \begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \\ w_{31} & w_{32} & w_{33} \\ \end{bmatrix} = \mathbf C\]Let’s look back at some threes
Let’s create a \(5 \times 5\) filter to try and find the left-hand prongs of the threes:
\[ \begin{bmatrix} 0 & 0 & -1 & 0 & 0 \\ 0 & -1 & -2 & -1 & 0 \\ -1 & -2 & 16 & -2 & -1 \\ 0 & -1 & -2 & -1 & 0 \\ 0 & 0 & -1 & 0 & 0 \end{bmatrix} \]
Using one filter produces one feature map
This one doesn’t work too well
It finds the edges of a three quite well!
It’s probably better if we try to learn the weights of the filter from the data!
What if we want more than 1 feature map?
A convolutional layer is a layer in a DNN where we perform \(K\) convolutions with \(K\) filters
A common small version of a CNN is called LeNet5Small
Given a set of weights for the kernel, \(\mathbf W_k\), and the corresponding bias term, \(b_k\), we can compute the feature map for a given image as:
\[\mathbf W_k \circledast \mathbf X_i + b_k\]
Each convolutional layer can have multiple filters, so we can look at each feature map separately
6 here
For Grayscale images (1 color channel), the first layer of convolutional filters (after passed through a ReLU activation), tend to learn oriented edges and outlines!
And way fewer parameters!
The total convolution weight matrix is a \(5 \times 5 \times 6\) matrix
Compare this to a fully connected layer with 6 hidden units
For each filter (of size \(5 \times 5\) in this case), there are learnable parameters that dictate the convolution
Choose parameters that describe important features within the images
Convolutions that describe most of the variation important in dictating the class of the image!
These parameters (convolution weights) are learned via backpropgation in the same way that other weights are learned!
The backprop update steps are similar to those for FCNNs
We aren’t going to go over these as it is largely redundant with our previous lectures
See here for a pretty good summary of how backprop works for the convolutional layers
Let’s look at the values for each “pixel” in the feature map for an image
The neighboring values are largely redundant
The values don’t change very much from neighbor to neighbor
Small drops
The reason for this is that the receptive field for each pixel when computing the convolution is largely shared neighbor to neighbor
Draw a picture on the board
One method of reducing this redundancy is to perform a strided convolution
Instead of using every pixel as a center for the filter, skip every \(s^{th}\) input
We’re still getting most of the information, just not doing every possible filter location!
An added benefit of strided convolution is that we shrink the size of the feature map for each image.
With same convolution via zero padding, we get a \(n_h \times n_w\) feature map.
For a \(f \times f\) filter, even zero-padding of \(p\) on each side, and even striding on the rows and columns (every \(s\) elements), we get:
\[\frac{n_h + 2p - f + s}{s} \times \frac{n_w + 2p - f + s}{s}\]
Strided convolutions are faster than dense convolutions
Cost is loss of resolution in the feature map
With strided convolutions, we don’t really lose all that much!
The cost is a little accuracy.
Striding can be really useful when the input images are big
The most popular way to deal with redundancies and reduce the dimensionality of the resulting feature map is to follow up each convolutional layer with a pooling layer
In each pooling layer, we look at each \(K \times K\) window of the feature map and return a single number summary within the filter.
Two common types:
Max pooling - return the maximum value within the window
Average Pooling - return the average value within the window
Max pooling:
\[\begin{bmatrix} \boxed{1} & \boxed{2} & 3 \\ \boxed{4} & \boxed{5} & 6 \\ 7 & 8 & 9\\ \end{bmatrix} = \begin{bmatrix} \boxed{5} & 6\\ 8 & 9\\ \end{bmatrix}\]Max pooling takes the strongest signal within each window of the feature map
Max pooling also reduces the dimensionality of the resulting feature map
Pooling is most commonly done with a window of size \(K \times K\) and a stride of \(s\)
When \(s = K\), we divide the image into non-overlapping partitions and take the pool
Note that there are no parameters here!
Suppose that our feature map is of size \(H \times W\)
The output is then a reduced feature map with size:
\[\frac{H - K}{s + 1} \times \frac{W - K}{s + 1}\]
Let’s look back at (small) LeNet5
We can see that max pooling:
Reduces the size and resolution of each feature map
Accentuates the activated pixels
Pooling layers are critical for scalable implementations of CNNs!
After an initial convolution + pooling layer, each image has \(K\) feature maps of size \(H \times W\)
Given \(C\) pooled feature maps of size \(H \times W\) for \(N\) images, we can express the new set of “images” as a new image with \(C\) channels
\[C \times H \times W\]
Why stop at 1 convolutional layer?
Create a convolution over the new “image”
Define a filter that looks at a \(K \times K\) patch in each feature map and across feature maps
Next filter will be \(C \times K \times K\)
The depth of a filter should always be equal to the number of input channels!
Seems more complicated, but it isn’t really!
Really cool visualization of this process:
Modern CNN architectures have the following form:
Input images (1 or 3 color channels)
Convolution + ReLU + Pool (reduce the feature map dimensionality) a few times
Flatten (take the 2d image and flatten it to a row like we’ve done previously)
FCNN (1 or 2 layers, typically)
Output layer
LeNet-5 is the seminal CNN architecture designed for the MNIST handwritten digits (1998!!!!)
A quick note on computational complexity:
The most expensive part of a CNN is the convolution part!
Lets suppose that we have a single \(3 \times 32 \times 32\) input image and we want to convolve it with 10 \(5 \times 5\) filters - each color channel uses the same filter!
We pad the image the have a same convolution:
Output volume: \(10 \times 32 \times 32\)
Number of parameters: \(10 \times 3 \times 5 \times 5 = 750\) weights plus 10 bias params
Sliding over the entire image, we get 1024 multiply-add operations.
Each operation requires 75 multiply adds
\(75 \times 1024 \times 10 = 768,000\) ops per layer!
This adds up!
GPUs are key to making CNNs move at a respectable pace for modern sized image analysis problems!
Let’s look at the real LeNet5.
Subsample of 20,000 \(28 \times 28\) images in 10 classes
Takes around 20 seconds to train!
More operations, but less hard operations
Leads to faster compute time than the worse FCNN!
Faster training convergence
Better validation accuracy!
Next time: